m 


V/orklng  Paper  No,  09 

Asjroptotlc  Behavior  of  k-means 
by 

James  B,  MacQueen 
November  1965 


f  1  A  K  I  V  S  H  •'»  J  t.  f 


[j  I■<f0 


.'.e. 


0  'y^ 


I 


J  'J 


_  _  lu.  C 

L'i  .  |!! 


(Li-ciU^I 


UuE^iSJiY 

DDC-IRA  B 


Western  Management  Science  Institute 
University  of  California  •  Los  Angeles 


University  of  California 


Los  Angeles 

Western  Management  Science  Institute 


V/orklng  Paper  No.  89 

On  The  Asymptotic  Behavior  of  k-means 
by 

James  B.  MacQueen 
November  1965 


This  work  was  supported  by  the  Western  Management  Science  Institute 
under  a  grant  from  the  Ford  Foundation,  and  by  the  Office  of  Naval 
Research  under  Contract  No.  233(75),  Task  No.  047-041.  Reproduction 
In  whole  or  In  part  Is  permitted  for  any  purpose  of  the  United  States 
Government . 


ON  THE  ASYMPTOTIC  BEHAVIOR 
OF  K-MEANS 


J.  MacQueen 

University  of  California^  Los  Angeles 

1.  Introduction .  Let  Zg, ...  be  a  random  sequence  of  points  (vectors) 


In  Ejj,  each  point  being  selected  Independently  of  the  preceding  ones  using 


a  fixed  probability  measure  p.  Thus  P[z^e  A]  =  p(A)  and  P[  z^^eA|  Zg,  •  ••  ,z 
=  p(A)#  n=l,2, ...,  for  A  any  measurable  set  in  Ej^.  Relative  to  a  given  k-tuple 
X  =  (x^jXg,  . . .,Xj^),  x^eEjj,  1  =  1,2,  ...k,  we  define  a  minimum  distance  partltlor 
S(x)  =  (S^(x),  S2(x),  ...,Sj^(x))  of  Ejj,  by  S^(x)  =  T^(x),  SgCx)  =  T2(x)S£(x),  . . 
Sj^(x)  »  S£(x)S^(x)...S^_^(x),  where  Tj,(x)  =  (§:§  s  E^j  , 

l§-x^l<|§-XjI ,  J  =  1, 2,  ...,k).  The  set  S^(x)  contains  the  points  in  E,jj  nearest 


to  x^,  with  tied  points  being  assigned  arbitrarily  to  the  set  of  lower  index. 


Note  that  with  this  convention  concerning  tied  points,  if  x^^^Xj  and  1  <  J 

n 
'i 


then  S.(x)  •=  0.  Sample  k-means  x”=(x“,Xg, . .  ,,x^),  x'/  e  E^,,  1  =  1,  ...,k, 


with  associated  integer  weights  (w”,  Wg, . . ,  are  now  defined  as  foUovra : 

x^  =  ^ i^ ^1  *  ^  and  for  n  =  1,2,...  if  ^k^n^^i  ^ 

n+1  /  nn.  \//n.,\  n+1  n  .  _  ,  n+1 

x^  =  (x^w^  +  ■*■!)»  =  w^  +  1,  and  x^ 

v^here  =  (S^,  Sg,  ...,sj^)  is  the  minimum  distance  partition  relati-ve  to  x'^. 

We  investigate  the  asymptotic  behavior  of  the  k-means,  making  the  special 


n  n+1 


=  Wj  for  J  i, 


assumptionsjC i),  p  is  absolutely  continuous  with  respect  to  Lebesgue  measure 


on  Ejj,  and  (ii),  p(R)  =  1  for  a  closed  and  bo\inded  convex  set  R  E^^,  and 


N' 


p(A)  >  0  for  every  open  set  A  c:  R.  For  a  given  k-tuple  x  =  (x^,Xg, . .  .x^^)  -- 
such  an  entity  being  referred  to  hereafter  as  a  k-point  --  let 
W(x)  =  Jg  Iz-x^l^dp(z)  , 

V(x)  =  Jg  lz-u^(x)l^dp(z)  , 


1 
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where  S={S^, Sg,  . .  .S^^)  is  the  minlnnim  distance  partition  relative  to 
and  u^(x)  ■  Jg  zdp(z)/p(S^)  or  u^(x)  =  x^  according  as  p(S^)  >  0  or 

p(S^)  =  0,  If  x^  =  Uj^(x),  i  =  we  say  the  k-point  x  is  unbiased. 

The  principle  I’esult  is 

1  2 

Theorem  1.  The  sequence  of  random  variables  W(x  ),  W(x  ),  ...  converges  a.s. 
and  =  lim^Jrf(x^)  is  r.s^  equal  to  V(x)  for  some  x  ^  the  class  of 
k-polnts  x=(x^,  Xg,  . .  .Xj^)  which  are  unbias edi  and  have  the  property  that 

x^  j4  Xj  if  i  ^  J. 

In  lieu  of  a  satisfactory  strong  law  of  large  numbers  for  k-means,  we 
obtain 

Theorem  2.  2?  ,  (S?  ,  p"lx“  -  u?l  )/m  0  as  m-  «  where  u”  =  u^(x^)  and 

-  n=l  1=1  ■^l  i  i.  a.s,  —  -  1  i'  '  - 

p”  =  p(S^(x")). 

Potertlal  applications  of  the  k-means  concept,  which  will  be  discussed 
in  detail  elsewhere,  occur  in  certain  taxomony  problems,  in  connection  with 
coding  and  patteni  recognition  problems,  in  the  description  of  categorizing 
behavior ,  and  in  connection  with  the  problem  of  locating  partitions  with 
minimum  average  variance  [5]  (See  Box  (l]  and  Ward  [6]  for  related  results.). 

2,  Proofs .  The  system  of  k-polnts  forms  a  complete  metric  space  if  the 
distance  p(x,y)  between  the  k-points  x  =  (xj^^Xg# . .  .Xj^)  and  y  =  (y^^^yg#  •  • -yj^)# 

is  defined  by  p(x,y)  =  where  dCa^b)  is  the  Euclidian  distance 

between  a  and  b.  We  designate  this  space  by  M  and  interpret  continuity^ 
limits,  convergence,  neighborhoods,  etc.,  in  the  usual  way  with  respect  to 
the  metric  topology  of  M.  Of  course,  every  bounded  sequence  of  k-polnts 
contains  a  convergent  subsequence. 
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Certaln  dlfficviltles  encountered  in  the  proof  of  theorem  1  are  caused 

convertient 

by  the  possibility  of  the  limit  of  a/sequence  of  k-points  having  some  of 
its  constituent  points  equal  to  each  other.  With  the  end  in  view  of 
circumventing  these  difficulties,  suppose  that  for  a  given  k-point 
X  =  (x^,X2, • . • x^e  R,  1=1,2,. ..,k,  we  have  x^  =  Xj  for  a  certain 
pair  i, ,1,i<J,  and  x^=Xj^  x^  for  p.  i,  /  J.  The  points  x^  and  Xj 
being  distinct  in  this  way,  and  considering  assumption  (ii)  we  necessarily 
have  p(S^(x))  >  0,  for  S^(x)  certainly  contains  an  open  sub-set  of  R. 

The  convention  concerning  tied  points  means  p(S  (x))  =  0.  Now  if  (y*')  = 

u 

C(y°#  is  a  sequence  of  k-points  satisfying  y^  e  R,  and 

y^  ^  yj  if  i|^J,  n=l,2, ...,  and  the  sequence  y”  approached  x,  then 

y”  and  y^  approach  x^=  Xj,  and  hence  each  other;  they  also  approach 
the  boundaries  of  S^(y”)  and  Sj(y^)  in  the  vicinity  of  x^.  The 
conditional  means  Uj^(y^)  and  Uj(y*^),  hov^ever,  must  remain  in  the 
interior  of  the  sets  Sj^(y^)  and  S^Cy”^)  respectively,  and  thus  tend  to 
become  separated  from  the  corresponding  points  y”  and  y”  .  In  fact, 
for  each  sufficiently  large  n,  the  distance  of  u^Cy"^^)  from  the  boundary  of 
S^Cy'')  or  the  distance  of  Uj(y*^)  from  the  boundary  of  Sj(y^),  will  exceed 
a  certain  positive  number.  For  as  a  tends  to  infinity,  p(S^(y*^))  +  p(Sj(y’^)) 
will  approach  p(S^(x))  >  0  --  a  simple  continuity  argument  based  on  the 
absoluto  continuity  of  p  will  establish  this  --  and  for  each  sufficiently 
large  n,  at  least  one  of  the  probabilities  p(S^(y'^))  or  p(Sj(y”))  will 
be  positive  by  a  definite  amount,  say  6.  But  in  view  of  the  boiondedness  of 
R,  a  convex  set  of  p  measure  at  least  6  >  0  cannot  have  its  conditional 
mean  arbitrarily  near  its  boundary.  This  line  of  reasoning,  which  extends 
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Immedlately  to  the  case  where  some  three  or  more  members  of  .  .x^^) 

are  equal,  gives  us 

Lermg  1.  Let  x  =  {x^,x^, , .  ^  the  limit  of  a  convergent  sequence  of 

k-points  (y”}  =  ((y^^yg#  satisfying  y^  e  R,  y"^  y^  ^  ^  i,  n=l,  2,  ... 

If  x^=Xj  for  some  ij^j  then  lim  inf^  ^1=1  )  ly”  -  u^(y”)l  >  0, 

Hence,  if  lim^_^^  ^=1  P(®i(y^))|yi  ~  Uj^(y^)l  =  0,  each  member  of  the 

k-tuple  (x^,  Xg, . .  .Xj^)  ^  distinct  from  the  others , 

We  remark  that  if  each  member  of  the  k-tuple  x=(x^,Xg, . .  .x^^)  is  distinct 
from  the  others,  then  TT(y)  =  (p(.3^(y)),  p(S2(y)),...  p(Sj^(y)),  regarded 
as  a  mapping  of  M  onto  Ej^,  is  ccntinuous  at  x—  this  follows  directly  from 
the  absolute  continuity  of  p.  Similarly  u(y)  =  (uj^(y),  UgCy), . ,  .Uj^(y)) 
regarded  as  a  mapping  from  M  onto  M  is  continuo\is  at  x  --  because  of  the 
absolute  continuity  of  p  and  the  boundness  of  R  (finiteness  of  Jzdp(z) 
would  do.)  Putting  this  remark  together  with  Lemma  1,  we  get 
Lemma  2.  Let  x  =  (x^,X2, . ,  .Xj^)  ^  the  limit  of  a  convergent  sequence  of 
k-points  Cy")  =  C(y",  yg,  ...y^))  satisfying  y”  e  R,  y“  y^  if  i  ^  J. 
n=l,2, ...  ,  If  p(S^(y"))|y"  -  u^Cy"^)  )  =  O  then 

p(S^(x) )  Ixjj^  -  u^(x”)l  =  0  and  each  poitit  x^  ^  the  k-tuple  (x^,X2,  . .  .x^) 
is  distinct  from  the  others . 

Lemma  1  and  2  above  are  primarily  technical  in  nature.  The  heart  of 
the  proofs  of  theorem  1  and  2  is  the  following  application  of  Martingale  theory: 
Lemma  3.  Let  tj^,  tg, ...,  and  §g, ...  be  given  sequences  of  random 
variables,  and  for  each  n=l,  2, . .  . ,  let  t^  be  measurable  with  respect  to 

where  ^  a  monotone  increasing  sequence  of  o  -  fields  (belonging 

to  the  underlying  probability  space).  Suppose  each  of  the  following  conditions 
holds  ^s.:  (i)  It^l  <  K<co  ,  (ii)  >  0,  E  \  (iii)  ^(tj^il  •  •  •  P^) 

<  t^  +  Then  the  sequences  of  random  variables  t^,  tg, . . . 


Sq,  8^,  Sg,  .  ,  ,  , 
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where  Sq  -  O  and  P^,P2»  •  • 1^2,...,  both 

converge  a.a. 

Proof.  Let  y  =  t  +«  so  that  the  y  form  a  Martingale  sequence, 

==  n  n  n-1  n 

Let  c  be  a  positive  number  and  consider  the  sequence  obtained  by 

stopping  (see [2],  p,  300)  at  the  first  n  for  which  -c.  From 

(ill)  we  see  that  y^^>  §^Kand  since  y^  “^11-1  —  have 

y^  >  max  (“^”l  -K, -(c+2K)).  The  sequence  (y)  is  a  Martingale,  so  that 

Ey^=  Ey^  ,  n=l,2,  and  being  bounded  from  below  with  Ely^|<  K,  certainly 

sup^Ely^l  <  ».  The  Martingale  Theorem  [2,  p,  3191  shows  y^  converges  a.s. 

But  y^«  y^  on  the  set  where-J^_^  -c-  K,  i  =  1,2,...,  and  (ii) 

Implies  PtA^l  -•  1  as  c  Th\is  (yj^)  converge  a.s.  This  means 

^n  1  "  ^n+1  a»s*  bounded.  Using  (ill)  we  can  write  "Sj^®  ^-l^i“^=l^i 

where  A^>  0.  But  since  s^  and  2^  5^  are  a.s.  bounded,  E  converges  a.s., 

s^  converges  a.s.,  and  finally^  so  does  t^^.  This  completes  the  proof. 

Turning  now  to  the  proof  of  Theorem  1,  let  stand  for  the  sequence 

z,,z„,  ...z  ,  and  let  A^  be  the  event  [z  . .  e  s”].  Since  is 

1  2  n-l+K  i  n+K  1 

the  minimum  distance  partition  relative  to  x  ,  we  have 

(1)  E[w(x'^^)lw^]  =  Jgn+l  \z  -  ^dp( z )  1 

<  E[2^Jgn  U  -  dp(z)lw^] 

=  JsJ  "  x5^^1^dp(z)lA^,  • 

If  Sj,  x^^  =  x”  for  i  j.  Thus  we  obtain 
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(2) 


E[W(x”^^)|tu^]  <  W(x”)  -  I^^j^(Jgnlz-x^|^dp(z))p” 


^  ^=.1  -  x^^l2dp(z)|A“,  uj^b"  . 

Several  applications  of  the  relation  J'^lz-xl^dp(z)  =  J*^! z-u|  ^dp(z)  +  p(A)|x-u|‘ 
where  dp(z)  =  0  ^  enables  us  to  write  the  last  term  in  (2)  as 

2-x^|2dp(z)pJ  - 


+  (pp^Uj-  Ujl^(Wj/wj+l))^  +  Jgnlz  -  u"l^dp(z)pj/(Wj+  l)^]. 

t) 

Combining  this  with  (2)  ,  we  get 

(3)  E(W(x"*‘^)  1  a^]  <  W(x“)  -  "  u“l^(pp^(2w°+  l)/(w"+  l)^ 

jJt  _2  /_nN2//..n^  ,\2 


^=1  ^ 


where  . 


('0 


Since  we  are  assuming  p(F)  =  1,  certainly  W(x^)  is  a.s.  bo\inded, 
2 

3  a  ..  We  now  show  that 
n,  J 

i:„(pp"/(w^  -  D" 


converges  a.s.  for  each  J=l, 2, ...k,  thereby  showing  that 

S  (Ev  ,  [a  .(Pj)  /(w‘,+  l)  ]  converges  a.s.  Then  Lemma  3  can  be  applied  with 
n  j“-L  j  J  J 

V  ^n=  ^=l^n,/Pp^A'^  !)• 


It  suffices  to  prove  that 


(5) 


s^2^Pp^/[(e  +  1  +  wpo  +  1  +  w^^)] 


converges  a.s.  for  any  positive  number  P  ;  eilso^  this  is  convenient,  for 

E(I^1u3  )  *=  P^  where  l”  is  the  characteristic  function  of  the  event 
j  n  j  j 

[z^  j  noting  that  ^  *  ^ssi  ^  direct  application 


-7- 


of  Theorem  1,  p.  27^>  in  [3]#  says  that  for  any  positive  numbers  a  and 
p[P+W/*'^>  1  +  2?  .p^  -  orZi?  -  for  all  n=  1,2,...]  >  1  -  (l+o3)"^  , 

where  Vj  Pj  -  (Pj)^  is  ill®  conditional  variance  of  Ij  given  We  take 

Q?=l,  and  thixs  with  probability  at  least  1  -  (l+P)”^  the  series  (5)  is 
dominated  by 

Vs  (pj)")  (1*^.i(p5)®)I 

■  ■  V(i+z?.i(p5)®)I  . 

which  clearly  converges. 

The  choice  of  3  being  arbitrary,  we  have  shown  that  (4)  converges  a.s. 
Application  of  Lemma  3  as  indicated  above  proves  W(x^)  converges  a.s. 


To  identify  the  limit  W^,  note  that  with  t^  and  taken  as  above, 

n  y-irlr  1 

Lemma  3  entails  a.s.  convergence  of  Z^[w(x  )  -  E[W(x  )|w^]],  and  hence  (3) 

Implies  a.s.  convergence  of 

<«>  *«/< 

Since  (6)  dominates  2  (Z^_^p“jx^  -  u^)/kn,  the  latter  converges  a.s., 

n  J  J  J  J 

and  a  little  consideration  makes  it  clear  that  2^_^Pjlxj  "  ^jl  “ 

2^_^p(S  (x*^))lx^  -  u  (x'^)l  converges  to  zero  on  a  sub-seguence  {x*^®}  and 
that  this  sub-sequence  has  itself  a  convergent  sub-sequence,  say  {x”ij. 

Let  X  =  (x^,X2,  ...Xj^)  =  llm^_^x"i.  Since  W(x)  =  V(x)  +  2^_^p(Sj(x) )  jx^-uCx)  1  ^ 
and  in  particular  W(x”)  =  V(x”)  +  2^^^p(Sj(x")) jxj-  u(xj)l^,  we  have  only  to 
show  (a),  lim^^  =  W^=  W(x),  and  (b),  lin^_^Z^^^  p(Sj(x”*) )  lx^-u(xj*)  1 

0  =  p(Sj(x))lxj  -  Uj(x)l^,  Then  W(x)  =  V(x)  and  x  is  a.s.  \inbiased, 

(Obviously  =  0  if  and  only  if  ^=iPil®il^  “  where  p^  >  0.) 
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We  show  that  (a)  is  true  hy  establishing  the  continuity  of  W(x). 

We  have 

U 

^  ^-1  •rsj(y)>'-^jl"  *  * 

"■  =1*3  -  ^3'  4^(7)!*  -  ■'3l«p(*»» 

with  the  last  Inequality  following  easily  from  the  triangle  Inequality.  Thus 
W(x)  <  W(y)  +  o(p(x,y))  ,  and  similarly  W(y)  <  W(x)  +  o(p(x,y)). 

To  establish  (b).  Lemma  2  con  be  applied  with  (y*^}  and  Identified, 

for  a.s.  x°  ’f  Xj  for  Ij^J,  n=l>2,  ...  ,  It  remains  to  remark  that  Lemma  2 
also  Implies  a.s.  for  Ij^  J.  The  proof  of  Theorem  1  Is  complete. 

Theorem  2  follows  from  the  a.s.  convergence  of 
upon  applying  an  elementary  result,  (c.f.  Theorem  C,  p.  203  In  [4])  which  says 
that  If  S  Sj^/n  converges,  2^_^aj^/n  -O. 

3.  Remarks.  In  a  n\imber  of  cases  covered  by  Theorem  1,  all  the  unbiased 
k-polnts  have  the  same  value  of  W,  In  this  situation.  Theorem  1  Implies 
Z^_^p^|x^-  u°l  converges  a.s.  to  zero.  An  example  Is  provided  by  the  uniform 
distribution  over  a  disk  In  -'f  k  •>  2,  the  unbiased  k-points  (x^,X2)  with 

Xj^  ^  Xg  consist  of  the  family  of  points  x^  and  opposite  one  another  on 
a  diameter,  and  at  a  certain  fixed  distance  from  the  center  of  the  disk.  (There 
is  one  \anbiased  k-point  with  x^  ■  Xg,  both  x^  and  Xg  being  at  the  center 
of  the  disk  in  this  case.)  The  k-mear.s  th\is  converge  to  some  such  relative 
position,  but  Theorem  1  does  not  quite  permit  us  to  eliminate  the  Interesting 
possibility  that  the  two  means  oscillate  slowly  but  indefinitely  around  the 


center 


-9- 


Theorem  1  provides  for  a.s.  convergence  of  J*i  “  '^i  ^  zero 

In  a  slightly  broader  class  of  situations:  Tills  Is  where  the  unbiased 
k-polnts  X  ■  (xj^^Xg,  ,..Xj^)  with  x^/  Xj  for  1  J#  are  all  stable  In  the 
sense  that  for  each  such  x,  W(y)  >  W(x)  (and  hence  V(y)  >  V(x))  for  all 
y  In  a  neighborhood  of  x.  In  this  case,  each  such  x  falls  In  one  of 
finitely  many  equivalence  classes  such  that  W  Is  constant  on  each  class. 
This  is  Illustrated  by  the  above  example,  where  there  is  only  a  single 
equivalence  class.  If  each  of  the  equivalence  classes  contains  only  a  single 
point.  Theorem  1  Implies  a.s.  convergence  of  x^  to  one  of  those  points. 

There  are  unbiased  k -points  which  are  not  stable.  Take  a  distribution 
on  Eg  which  has  sharp  peaks  of  probdJillty  at  each  comer  of  a  square,  and 
is  symetric  about  both  diagonals.  With  ka!2,  the  two  constituent  points  can 
be  symetrically  located  on  a  diagonal  so  that  the  boundary  of  the  associated 
minimum  distance  partition  coinc5.des  with  the  other  diagonal.  With  some 
adjustment,  such  a  k-polnt  can  be  made  to  be  unbiased,  and  If  the 
probability  Is  sxifficiently  concentrated  at  the  corners  of  the  square,  any 
small  movement  of  the  two  points  off  the  diagonal  In  opposite  directions, 
results  in  a  decrease  In  W(x).  It  seems  likely  that  the  k-means  cannot 
converge  to  such  a  configuration. 
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n+1  n  n+1  n  ... 

X.  a  X.,  W  a  W.,  J  js  i, 

J  u  J  J 

The  asymptotic  behavior  of  the  k-means  is  studied  and  it  is  shown  that 
J  ^  |z  -  x°|^  dp(2)  converges  a.s.,  where  s"  is  the  region  in 
®i 

nearer  to  x“  than  x**,  J  ^  i,  and  p  is  the  cossson  probllity  measure  of 

^  <J 

the  y^  . 

Applications  of  the  k-means  concept  occur  in  statistical  analysis  of 
N-dlmenslonal  data,  in  coding  problems,  and  in  the  dlscription  of  human 
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